Instructions:
Before submitting,
import scipy, os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
# add more imports here if you like
# ...
# # if you use Google Colab:
# from google.colab import drive
# drive.mount('/content/drive')
# change this line your folder where the data is found
# basedir = "C:/Users/andre/Dropbox/MSc Data Science and Analytics/Computational Data Science/Coursework 2"
basedir = os.getcwd()
In this part you will be working with the listings.csv data. To help you wrap around your head we will first provide some information on the main columns in the data.
Dataframe columns description:
id - unique ID identifying the listing
name - title of the listing
host_id - unique ID for a host
host_name - first name of the host
host_since - date that the host first joined Airbnb
host_is_superhost - whether or not the host is a superhost, which is a mark of quality for the top-rated and most experienced hosts, and can increase your search ranking on Airbnb
host_listings_count - how many listings the host has in total
host_has_profile_pic - whether or not the host has a profile picture
host_identity_verified - whether or not the host has been verified with his passport
neighbourhood_cleansed - the borough the property is in
latitude and longitude - geolocation coordinates of the property
property_type - type of property, e.g. house or flat
room_type - type of listing, e.g. entire home, private room or shared room
accommodates - how many people the property accommodates
bedrooms - number of bedrooms
beds - number of beds
price - nightly advertised price (the target variable)
minimum_nights - the minimum length of stay
maximum_nights - the maximum length of stay
availability_30 - how many nights are available to be booked in the next 30 days
availability_60 - how many nights are available to be booked in the next 60 days
availability_90 - how many nights are available to be booked in the next 90 days
availability_365 - how many nights are available to be booked in the next 365 days
number_of_reviews - the number of reviews left for the property
number_of_reviews_ltm - the number of reviews left for the property in the last twelve months
first_review - the date of the first review
last_review - the date of the most recent review
review_scores_rating - guests can score properties overall from 1 to 5 stars
review_scores_accuracy - guests can score the accuracy of a property's description from 1 to 5 stars
review_scores_cleanliness - guests can score a property's cleanliness from 1 to 5 stars
review_scores_checkin - guests can score their check-in from 1 to 5 stars
review_scores_communication - guests can score a host's communication from 1 to 5 stars
review_scores_location - guests can score a property's location from 1 to 5 stars
review_scores_value - guests can score a booking's value for money from 1 to 5 stars
instant_bookable - whether or not the property can be instant booked (i.e. booked straight away, without having to message the host first and wait to be accepted)
reviews_per_month - calculated field of the average number of reviews left by guest each month
The next two cells load the listings.csv file into a dataframe. Once loaded, start working on the subsequent questions.
### DO NOT CHANGE THIS CELL
def load_csv(basedir):
return pd.read_csv(os.path.join(basedir, 'listings.csv'))
### DO NOT CHANGE THIS CELL
df = load_csv(basedir)
df.head()
| id | listing_url | scrape_id | last_scraped | name | description | neighborhood_overview | picture_url | host_id | host_url | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2818 | https://www.airbnb.com/rooms/2818 | 20201212211823 | 2020-12-13 | Quiet Garden View Room & Super Fast WiFi | Quiet Garden View Room & Super Fast WiFi<br />... | Indische Buurt ("Indies Neighborhood") is a ne... | https://a0.muscache.com/pictures/10272854/8dcc... | 3159 | https://www.airbnb.com/users/show/3159 | ... | 10.0 | 9.0 | 10.0 | NaN | t | 1 | 0 | 1 | 0 | 1.95 |
| 1 | 20168 | https://www.airbnb.com/rooms/20168 | 20201212211823 | 2020-12-13 | Studio with private bathroom in the centre 1 | 17th century Dutch townhouse in the heart of t... | Located just in between famous central canals.... | https://a0.muscache.com/pictures/69979628/fd6a... | 59484 | https://www.airbnb.com/users/show/59484 | ... | 10.0 | 10.0 | 9.0 | NaN | t | 2 | 0 | 2 | 0 | 2.58 |
| 2 | 25428 | https://www.airbnb.com/rooms/25428 | 20201212211823 | 2020-12-13 | Lovely apt in City Centre (w.lift) near Jordaan | Lovely apt in Centre ( lift & fireplace) near ... | NaN | https://a0.muscache.com/pictures/138431/7079a9... | 56142 | https://www.airbnb.com/users/show/56142 | ... | 10.0 | 10.0 | 10.0 | NaN | f | 1 | 1 | 0 | 0 | 0.14 |
| 3 | 27886 | https://www.airbnb.com/rooms/27886 | 20201212211823 | 2020-12-13 | Romantic, stylish B&B houseboat in canal district | Stylish and romantic houseboat on fantastic hi... | Central, quiet, safe, clean and beautiful. | https://a0.muscache.com/pictures/02c2da9d-660e... | 97647 | https://www.airbnb.com/users/show/97647 | ... | 10.0 | 10.0 | 10.0 | NaN | t | 1 | 0 | 1 | 0 | 2.01 |
| 4 | 28871 | https://www.airbnb.com/rooms/28871 | 20201212211823 | 2020-12-13 | Comfortable double room | <b>The space</b><br />In a monumental house ri... | Flower market , Leidseplein , Rembrantsplein | https://a0.muscache.com/pictures/160889/362340... | 124245 | https://www.airbnb.com/users/show/124245 | ... | 10.0 | 10.0 | 10.0 | NaN | f | 2 | 0 | 2 | 0 | 2.68 |
5 rows × 74 columns
# Do not rename the function, do not remove the return statement.
# Just add code before the return statement to add the required functionality.
def drop_cols(df):
'''
Drops specified columns in the Question 1a description
Args:
df: The dataframe we want to modify.
Returns: The dataframe without the columns we specified to drop.
'''
# Drop action for the specified columns
df.drop(['scrape_id','last_scraped','description','listing_url','neighbourhood','calendar_last_scraped', 'amenities','neighborhood_overview', 'picture_url','host_url', 'host_about', 'host_location','host_total_listings_count','host_thumbnail_url','host_picture_url', 'host_verifications','bathrooms_text','has_availability','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm','number_of_reviews_l30d','calculated_host_listings_count','calculated_host_listings_count_entire_homes','calculated_host_listings_count_private_rooms','calculated_host_listings_count_shared_rooms'], inplace=True, axis=1)
return df
df = drop_cols(df)
df.columns
Index(['id', 'name', 'host_id', 'host_name', 'host_since',
'host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_neighbourhood', 'host_listings_count',
'host_has_profile_pic', 'host_identity_verified',
'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights',
'calendar_updated', 'availability_30', 'availability_60',
'availability_90', 'availability_365', 'number_of_reviews',
'number_of_reviews_ltm', 'first_review', 'last_review',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'license', 'instant_bookable',
'reviews_per_month'],
dtype='object')
print(df.isna().sum())
id 0 name 33 host_id 0 host_name 55 host_since 55 host_response_time 14273 host_response_rate 14273 host_acceptance_rate 9255 host_is_superhost 55 host_neighbourhood 6203 host_listings_count 55 host_has_profile_pic 55 host_identity_verified 55 neighbourhood_cleansed 0 neighbourhood_group_cleansed 18522 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bathrooms 18522 bedrooms 1014 beds 107 price 0 minimum_nights 0 maximum_nights 0 calendar_updated 18522 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 first_review 2375 last_review 2375 review_scores_rating 2620 review_scores_accuracy 2630 review_scores_cleanliness 2630 review_scores_checkin 2638 review_scores_communication 2631 review_scores_location 2636 review_scores_value 2636 license 18522 instant_bookable 0 reviews_per_month 2375 dtype: int64
def drop_cols_na(df, threshold = 0.5):
'''
Drops columns according to the amount of NaN values they contain.
Args:
df: The dataframe we want to modify.
threshold: Threshold is the limit of which if a column has equal or higher proportion of NaNs the column is dropped.
Returns: The dataframe without the columns dropped for having higher proportion of NaNs than the threshold.
'''
# Number of rows of the dataset
rows = len(df)
# Sum of NaNs per column
column_na = list(zip(df.columns, df.isna().sum()))
for column in column_na:
# column[1] / rows is the proportion of NaNs for the column
if column[1] / rows > threshold:
df.drop(column[0], inplace=True, axis=1)
return df
df = drop_cols_na(df)
df.columns
Index(['id', 'name', 'host_id', 'host_name', 'host_since',
'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
'host_listings_count', 'host_has_profile_pic', 'host_identity_verified',
'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
'room_type', 'accommodates', 'bedrooms', 'beds', 'price',
'minimum_nights', 'maximum_nights', 'availability_30',
'availability_60', 'availability_90', 'availability_365',
'number_of_reviews', 'number_of_reviews_ltm', 'first_review',
'last_review', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'instant_bookable', 'reviews_per_month'],
dtype='object')
def binary_encoding(df):
'''
For columns that have as values the strings ‘t’ (for True) and ‘f’ (for False) we recode these columns by turning them into the integer numbers 0 and 1.
Args:
df: The dataframe we want to modify.
Returns: The dataframe with columns that had as values the strings ‘t’ (for True) and ‘f’ (for False) changed into the integer numbers 0 and 1.
'''
for column in df.columns:
values = dict(df[column].value_counts())
# If a column has as values the strings ‘t’ (for True) and ‘f’ (for False) this condition will be true
if (list(values.keys()) == ['f','t']) or (list(values.keys()) == ['t','f']):
# print(column)
temp = []
for index, row in df.iterrows():
if df[column][index] == 't':
df[column][index] = 1
elif df[column][index] == 'f':
df[column][index] = 0
return df
df = binary_encoding(df)
<ipython-input-11-0f50708d4bc6>:20: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[column][index] = 1 <ipython-input-11-0f50708d4bc6>:22: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[column][index] = 0
# hint: check Pandas to_datetime method
def add_host_days(df):
'''
We insert a column that represents the number of days (with respect to the current date) that the host has been registered, called 'host_days'.
Args:
df: The dataframe we want to modify.
Returns: The dataframe with the new column 'host_days'.
'''
# Initialize empty array to append the values of the new column 'host_days'
host_days = []
for day in df['host_since']:
# With this try-except we catch the Error of having NaN as the value of 'host_since' column for a host. If we have NaN we will add None to the array host_days
try:
host_days.append(int((pd.to_datetime('today').normalize() - pd.to_datetime(day)).days))
except:
host_days.append(None)
df['host_days'] = host_days
return df
def convert_price(df):
'''
Convert the values of column 'price' from string of format string '$XX' to float XX.0.
Args:
df: The dataframe we want to modify.
Returns: The dataframe with the modified column 'price'.
'''
prices = []
for price in df['price']:
# By getting price[1:] we are removing the $ from the string
prices.append(float(price[1:].replace(',','')))
df['price'] = prices
return df
df = add_host_days(df)
df = convert_price(df)
You do not need to write the answer. In each cell, provide the Pandas code that outputs the result. Each answer can be given with 1-2 lines of Python code. Example question and answer:
# What is the total number of rows in the dataframe?
df.shape[0]
Now over to you:
# How many hosts offer 2 or more properties for rent?
sum(i > 1 for i in list(df['host_id'].value_counts()))
1331
# What is the highest price for a listing?
df['price'].max()
8000.0
# What is the ID of the listing that has the largest number of bedrooms?
df[df['bedrooms'] == df['bedrooms'].max()].iloc[0]['id']
46015289
# What is the ID of the listing with the largest advertised price
df[df['price'] == df['price'].max()].iloc[0]['id']
258273
# There are different room types. How many listings are there for the most common room type?
df['room_type'].value_counts().max()
14433
# How many hosts are there that have been registered for more than 3000 days?
len(pd.unique(df['host_id'][df['host_days'] > 3000]))
2400
Produce a barplot of the average nightly price per neighbourhood as instructed in the Coursework proforma:
sns.set_context('paper')
f, ax = plt.subplots(figsize = (9,18))
sns.set_color_codes('pastel')
sns.barplot(x = 'price', y = 'neighbourhood_cleansed', data = df,
label = 'neighbourhood_cleansed', color = 'b', edgecolor = 'w')
plt.show()
Plot a correlation matrix as instructed in the Coursework proforma:
ax = sns.heatmap(df[['review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']].corr())
Plot a geographical distribution as instructed in the Coursework proforma:
# To group the values we assign one value per group of value inside the ranges we want to display in the scatterplot
mydata = df[df.price>150]
mydata.price[mydata.price>6000] = 7500
mydata.price[(mydata.price>4500) & (mydata.price<=6000)] = 6000
mydata.price[(mydata.price>3000) & (mydata.price<=4500)] = 4500
mydata.price[(mydata.price>1500) & (mydata.price<=3000)] = 3000
mydata.price[mydata.price<=1500] = 1500
# from IPython.display import clear_output
# clear_output()
<ipython-input-24-c8790439fb56>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy mydata.price[mydata.price>6000] = 7500 C:\Users\Andreas\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py:992: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._where(~key, value, inplace=True) <ipython-input-24-c8790439fb56>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy mydata.price[(mydata.price>4500) & (mydata.price<=6000)] = 6000 C:\Users\Andreas\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py:992: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._where(~key, value, inplace=True) <ipython-input-24-c8790439fb56>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy mydata.price[(mydata.price>3000) & (mydata.price<=4500)] = 4500 C:\Users\Andreas\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py:992: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._where(~key, value, inplace=True) <ipython-input-24-c8790439fb56>:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy mydata.price[(mydata.price>1500) & (mydata.price<=3000)] = 3000 C:\Users\Andreas\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py:992: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._where(~key, value, inplace=True) <ipython-input-24-c8790439fb56>:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy mydata.price[mydata.price<=1500] = 1500 C:\Users\Andreas\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py:992: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._where(~key, value, inplace=True)
sns.set_theme(style="white")
f, ax = plt.subplots(figsize = (15,10))
yo = sns.scatterplot(data=mydata, x="longitude", y="latitude", hue="price", size="price",
sizes=(5, 500), hue_norm=(0, 100), alpha=1, palette="muted", legend = "full")
plt.show()
Instructions:
Before submitting,
# Performing a linear regression on rating using statsmodels.
fit = sm.OLS.from_formula('review_scores_rating ~ review_scores_accuracy + review_scores_cleanliness + review_scores_checkin + review_scores_communication + review_scores_location + review_scores_value', df).fit()
# With xname we are making sure that the variable names shown in the summary are short and readable
print(fit.summary(xname=['Intercept','accuracy', 'cleanliness', 'checkin', 'communication', 'location', 'value']))
OLS Regression Results
================================================================================
Dep. Variable: review_scores_rating R-squared: 0.726
Model: OLS Adj. R-squared: 0.726
Method: Least Squares F-statistic: 7008.
Date: Fri, 04 Jun 2021 Prob (F-statistic): 0.00
Time: 03:34:14 Log-Likelihood: -42957.
No. Observations: 15880 AIC: 8.593e+04
Df Residuals: 15873 BIC: 8.598e+04
Df Model: 6
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.3169 0.550 -0.577 0.564 -1.394 0.760
accuracy 2.6153 0.056 46.889 0.000 2.506 2.725
cleanliness 2.1438 0.041 52.279 0.000 2.063 2.224
checkin 1.0232 0.060 16.913 0.000 0.905 1.142
communication 1.8090 0.066 27.490 0.000 1.680 1.938
location 0.3988 0.045 8.858 0.000 0.311 0.487
value 1.9905 0.047 42.354 0.000 1.898 2.083
==============================================================================
Omnibus: 4769.894 Durbin-Watson: 1.991
Prob(Omnibus): 0.000 Jarque-Bera (JB): 90176.958
Skew: -0.965 Prob(JB): 0.00
Kurtosis: 14.514 Cond. No. 450.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Creating a list with the names of room_types existing in our dataframe
room_types = [room for room in dict(df.room_type.value_counts()).keys()]
def multiple_t_tests(df, alpha = 0.01, create_df = False, room_types = room_types):
'''
Performs multiple t-tests for the
Args:
df : The dataframe on which room_types we will perform the multiple t-tests.
alpha: The alpha with which we want to perform our t-tests.
create_df: Boolean value that if it is True we are creating a Dataframe with 4 rows and 4 columns that holds the p-values for all pairwise combinations.
room_types: The room_types that exist in our dataset.
Returns: The dataframe with the modified column 'price'.
'''
# Store combination of room_types that are different to use it in the answer later
significantly_different_price = []
# Initialize a list for the p-values if we want to create a 4x4 Dataframe.
if create_df : p_value_list = []
for type1 in room_types:
if create_df : p_value = []
for type2 in room_types:
if type1 != type2:
print(f'\n{type1} --- {type2}\n\n')
df_room_type_1 = df[df.room_type == type1]
df_room_type_2 = df[df.room_type == type2]
# This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
res = scipy.stats.ttest_ind(df_room_type_1.price, df_room_type_2.price)
if res.pvalue < alpha:
if create_df : p_value.append(res.pvalue)
print(f'P-value: {res.pvalue:.19f} < {alpha}')
print(f'T-statistic: {res.statistic:.3f}')
print(f'\'{type1}\' room type has different prices to \'{type2}\' room type (at a significance level of alpha={alpha}).')
if len(significantly_different_price) > 0:
count = len(significantly_different_price)
for record in significantly_different_price:
if (type1 not in record) or (type2 not in record):
count -= 1
if count == 0: significantly_different_price.append(f'The price for \'{type1}\' is different to \'{type2}\'')
else:
significantly_different_price.append(f'The price for \'{type1}\' is different to \'{type2}\'')
# significantly_different_price.append(f'The price for \'{type1}\' is different to \'{type2}\'')
else:
if create_df : p_value.append(res.pvalue)
print(f'P-value: {res.pvalue:.19f} > {alpha}')
print(f'T-statistic: {res.statistic:.3f}')
print(f'\'{type1}\' room type does not have different prices to \'{type2}\' room type (at a significance level of alpha={alpha}).')
else:
if create_df : p_value.append(None)
if create_df : p_value_list.append(tuple(p_value))
try:
return p_value_list, significantly_different_price
except:
return significantly_different_price
p_value_list, significantly_different_price = multiple_t_tests(df, 0.01, True)
Entire home/apt --- Private room P-value: 0.0000000000000000000 < 0.01 T-statistic: 29.275 'Entire home/apt' room type has different prices to 'Private room' room type (at a significance level of alpha=0.01). Entire home/apt --- Hotel room P-value: 0.0029292208875053489 < 0.01 T-statistic: 2.976 'Entire home/apt' room type has different prices to 'Hotel room' room type (at a significance level of alpha=0.01). Entire home/apt --- Shared room P-value: 0.0021768318613784249 < 0.01 T-statistic: 3.066 'Entire home/apt' room type has different prices to 'Shared room' room type (at a significance level of alpha=0.01). Private room --- Entire home/apt P-value: 0.0000000000000000000 < 0.01 T-statistic: -29.275 'Private room' room type has different prices to 'Entire home/apt' room type (at a significance level of alpha=0.01). Private room --- Hotel room P-value: 0.0000596655232293761 < 0.01 T-statistic: -4.018 'Private room' room type has different prices to 'Hotel room' room type (at a significance level of alpha=0.01). Private room --- Shared room P-value: 0.4516928878964464600 > 0.01 T-statistic: -0.753 'Private room' room type does not have different prices to 'Shared room' room type (at a significance level of alpha=0.01). Hotel room --- Entire home/apt P-value: 0.0029292208875053489 < 0.01 T-statistic: -2.976 'Hotel room' room type has different prices to 'Entire home/apt' room type (at a significance level of alpha=0.01). Hotel room --- Private room P-value: 0.0000596655232293761 < 0.01 T-statistic: 4.018 'Hotel room' room type has different prices to 'Private room' room type (at a significance level of alpha=0.01). Hotel room --- Shared room P-value: 0.3121192503013984210 > 0.01 T-statistic: 1.013 'Hotel room' room type does not have different prices to 'Shared room' room type (at a significance level of alpha=0.01). Shared room --- Entire home/apt P-value: 0.0021768318613784249 < 0.01 T-statistic: -3.066 'Shared room' room type has different prices to 'Entire home/apt' room type (at a significance level of alpha=0.01). Shared room --- Private room P-value: 0.4516928878964464600 > 0.01 T-statistic: 0.753 'Shared room' room type does not have different prices to 'Private room' room type (at a significance level of alpha=0.01). Shared room --- Hotel room P-value: 0.3121192503013984210 > 0.01 T-statistic: -1.013 'Shared room' room type does not have different prices to 'Hotel room' room type (at a significance level of alpha=0.01).
bonferroni_p_value_list, bonferroni_significantly_different_price = multiple_t_tests(df, 0.01/12, True, room_types)
Entire home/apt --- Private room P-value: 0.0000000000000000000 < 0.0008333333333333334 T-statistic: 29.275 'Entire home/apt' room type has different prices to 'Private room' room type (at a significance level of alpha=0.0008333333333333334). Entire home/apt --- Hotel room P-value: 0.0029292208875053489 > 0.0008333333333333334 T-statistic: 2.976 'Entire home/apt' room type does not have different prices to 'Hotel room' room type (at a significance level of alpha=0.0008333333333333334). Entire home/apt --- Shared room P-value: 0.0021768318613784249 > 0.0008333333333333334 T-statistic: 3.066 'Entire home/apt' room type does not have different prices to 'Shared room' room type (at a significance level of alpha=0.0008333333333333334). Private room --- Entire home/apt P-value: 0.0000000000000000000 < 0.0008333333333333334 T-statistic: -29.275 'Private room' room type has different prices to 'Entire home/apt' room type (at a significance level of alpha=0.0008333333333333334). Private room --- Hotel room P-value: 0.0000596655232293761 < 0.0008333333333333334 T-statistic: -4.018 'Private room' room type has different prices to 'Hotel room' room type (at a significance level of alpha=0.0008333333333333334). Private room --- Shared room P-value: 0.4516928878964464600 > 0.0008333333333333334 T-statistic: -0.753 'Private room' room type does not have different prices to 'Shared room' room type (at a significance level of alpha=0.0008333333333333334). Hotel room --- Entire home/apt P-value: 0.0029292208875053489 > 0.0008333333333333334 T-statistic: -2.976 'Hotel room' room type does not have different prices to 'Entire home/apt' room type (at a significance level of alpha=0.0008333333333333334). Hotel room --- Private room P-value: 0.0000596655232293761 < 0.0008333333333333334 T-statistic: 4.018 'Hotel room' room type has different prices to 'Private room' room type (at a significance level of alpha=0.0008333333333333334). Hotel room --- Shared room P-value: 0.3121192503013984210 > 0.0008333333333333334 T-statistic: 1.013 'Hotel room' room type does not have different prices to 'Shared room' room type (at a significance level of alpha=0.0008333333333333334). Shared room --- Entire home/apt P-value: 0.0021768318613784249 > 0.0008333333333333334 T-statistic: -3.066 'Shared room' room type does not have different prices to 'Entire home/apt' room type (at a significance level of alpha=0.0008333333333333334). Shared room --- Private room P-value: 0.4516928878964464600 > 0.0008333333333333334 T-statistic: 0.753 'Shared room' room type does not have different prices to 'Private room' room type (at a significance level of alpha=0.0008333333333333334). Shared room --- Hotel room P-value: 0.3121192503013984210 > 0.0008333333333333334 T-statistic: -1.013 'Shared room' room type does not have different prices to 'Hotel room' room type (at a significance level of alpha=0.0008333333333333334).
df_p_value = pd.DataFrame(bonferroni_p_value_list, columns = room_types, index=room_types)
df_p_value
| Entire home/apt | Private room | Hotel room | Shared room | |
|---|---|---|---|---|
| Entire home/apt | NaN | 3.624982e-184 | 0.002929 | 0.002177 |
| Private room | 3.624982e-184 | NaN | 0.000060 | 0.451693 |
| Hotel room | 2.929221e-03 | 5.966552e-05 | NaN | 0.312119 |
| Shared room | 2.176832e-03 | 4.516929e-01 | 0.312119 | NaN |
significantly_different_price
["The price for 'Entire home/apt' is different to 'Private room'", "The price for 'Entire home/apt' is different to 'Hotel room'", "The price for 'Entire home/apt' is different to 'Shared room'", "The price for 'Private room' is different to 'Hotel room'"]
bonferroni_significantly_different_price
["The price for 'Entire home/apt' is different to 'Private room'", "The price for 'Private room' is different to 'Hotel room'"]
T-test questions:
Which room types are significantly different in terms of nightly price?
YOUR ANSWER (1-2 sentences): For the combinations 'Entire home/apt' with 'Private room', 'Entire home/apt' with 'Hotel room'", 'Entire home/apt' with 'Shared room' and 'Private room' with 'Hotel room' we observe for alpha = 0.01 that they are significantly different in terms of nightly price.
Do the significances change if you perform Bonferroni correction to the alpha level: https://en.wikipedia.org/wiki/Bonferroni_correction ?
YOUR ANSWER (1-2 sentences): After performing Bonferroni correction we observe that for the combinations 'Entire home/apt' with 'Private room' and 'Private room' with 'Hotel room' for alpha = 0.01/12 that they are significantly different in terms of nightly price. We divide the alpha by 12 as this is the number of combinations for the different type of rooms.
Provide a short justification (2-3 sentences) for your choice of variables.
YOUR ANSWER: We will not take into account the columns 'id', 'name', 'host_id', 'host_name', 'host_since' as their information will only add biases to our model. Also we will drop the columns with more than 500 NaNs as we want to drop NaNs and the categorical with many different values to not highly increase dimensionality 'neighbourhood_cleansed' 'property_type'. For the remaining columns we will compute VIF and remove columns with high Multicollinearity.
print(df.isna().sum())
id 0 name 33 host_id 0 host_name 55 host_since 55 host_acceptance_rate 9255 host_is_superhost 55 host_neighbourhood 6203 host_listings_count 55 host_has_profile_pic 55 host_identity_verified 55 neighbourhood_cleansed 0 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bedrooms 1014 beds 107 price 0 minimum_nights 0 maximum_nights 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 first_review 2375 last_review 2375 review_scores_rating 2620 review_scores_accuracy 2630 review_scores_cleanliness 2630 review_scores_checkin 2638 review_scores_communication 2631 review_scores_location 2636 review_scores_value 2636 instant_bookable 0 reviews_per_month 2375 host_days 55 dtype: int64
list_of_columns = ['host_is_superhost','host_listings_count','host_has_profile_pic','host_identity_verified','latitude','longitude','room_type','accommodates','beds','price','minimum_nights','maximum_nights','availability_30','availability_60','availability_90','availability_365' ,'number_of_reviews','number_of_reviews_ltm','instant_bookable','host_days']
temp_df = df[list_of_columns]
print(temp_df.isna().sum())
host_is_superhost 55 host_listings_count 55 host_has_profile_pic 55 host_identity_verified 55 latitude 0 longitude 0 room_type 0 accommodates 0 beds 107 price 0 minimum_nights 0 maximum_nights 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 instant_bookable 0 host_days 55 dtype: int64
temp_df = temp_df.dropna()
print(temp_df.isna().sum())
host_is_superhost 0 host_listings_count 0 host_has_profile_pic 0 host_identity_verified 0 latitude 0 longitude 0 room_type 0 accommodates 0 beds 0 price 0 minimum_nights 0 maximum_nights 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 instant_bookable 0 host_days 0 dtype: int64
room_type, price = temp_df['room_type'], temp_df['price']
temp_df.drop(['room_type','price'], inplace=True, axis=1)
temp_df = pd.get_dummies(temp_df)
temp_df['room_type'], temp_df['price'] = room_type, price
print(temp_df.isna().sum())
host_listings_count 0 latitude 0 longitude 0 accommodates 0 beds 0 minimum_nights 0 maximum_nights 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 host_days 0 host_is_superhost_0 0 host_is_superhost_1 0 host_has_profile_pic_0 0 host_has_profile_pic_1 0 host_identity_verified_0 0 host_identity_verified_1 0 instant_bookable_0 0 instant_bookable_1 0 room_type 0 price 0 dtype: int64
'''The code is taken from a course of DataCamp
link --> https://campus.datacamp.com/courses/generalized-linear-models-in-python'''
# !pip install statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor
def vif_calculator(X):
# Calculating VIF
vif = pd.DataFrame()
vif["Columns"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return(vif)
list_of_columns.remove('room_type')
list_of_columns.remove('price')
list_of_columns.remove('host_identity_verified')
list_of_columns.append('host_identity_verified_1')
list_of_columns.remove('host_is_superhost')
list_of_columns.append('host_is_superhost_1')
list_of_columns.remove('host_has_profile_pic')
list_of_columns.append('host_has_profile_pic_1')
list_of_columns.remove('instant_bookable')
list_of_columns.append('instant_bookable_1')
temp_df.drop(['host_is_superhost_0','instant_bookable_0','host_has_profile_pic_0','host_identity_verified_0'], inplace=True, axis=1)
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.061020 |
| 1 | latitude | 19172.895401 |
| 2 | longitude | 18521.399364 |
| 3 | accommodates | 15.040151 |
| 4 | beds | 6.348844 |
| 5 | minimum_nights | 1.072284 |
| 6 | maximum_nights | 2.282348 |
| 7 | availability_30 | 32.061717 |
| 8 | availability_60 | 210.042704 |
| 9 | availability_90 | 120.927053 |
| 10 | availability_365 | 4.107209 |
| 11 | number_of_reviews | 2.072752 |
| 12 | number_of_reviews_ltm | 1.683283 |
| 13 | host_days | 10.665914 |
| 14 | host_identity_verified_1 | 3.084726 |
| 15 | host_is_superhost_1 | 1.370230 |
| 16 | host_has_profile_pic_1 | 707.684905 |
| 17 | instant_bookable_1 | 1.471855 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('latitude')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.060980 |
| 1 | longitude | 694.990558 |
| 2 | accommodates | 15.039442 |
| 3 | beds | 6.326988 |
| 4 | minimum_nights | 1.072275 |
| 5 | maximum_nights | 2.281470 |
| 6 | availability_30 | 32.061086 |
| 7 | availability_60 | 210.042684 |
| 8 | availability_90 | 120.927044 |
| 9 | availability_365 | 4.105909 |
| 10 | number_of_reviews | 2.072740 |
| 11 | number_of_reviews_ltm | 1.683272 |
| 12 | host_days | 10.650579 |
| 13 | host_identity_verified_1 | 3.084644 |
| 14 | host_is_superhost_1 | 1.369559 |
| 15 | host_has_profile_pic_1 | 683.385331 |
| 16 | instant_bookable_1 | 1.471577 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('longitude')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.060932 |
| 1 | accommodates | 14.962796 |
| 2 | beds | 6.326094 |
| 3 | minimum_nights | 1.072179 |
| 4 | maximum_nights | 2.279227 |
| 5 | availability_30 | 32.061032 |
| 6 | availability_60 | 210.042422 |
| 7 | availability_90 | 120.926431 |
| 8 | availability_365 | 4.105214 |
| 9 | number_of_reviews | 2.072411 |
| 10 | number_of_reviews_ltm | 1.683086 |
| 11 | host_days | 10.573276 |
| 12 | host_identity_verified_1 | 3.084631 |
| 13 | host_is_superhost_1 | 1.369549 |
| 14 | host_has_profile_pic_1 | 18.665684 |
| 15 | instant_bookable_1 | 1.468406 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('availability_60')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.060786 |
| 1 | accommodates | 14.962478 |
| 2 | beds | 6.325707 |
| 3 | minimum_nights | 1.072171 |
| 4 | maximum_nights | 2.278752 |
| 5 | availability_30 | 10.661680 |
| 6 | availability_90 | 14.788639 |
| 7 | availability_365 | 3.997886 |
| 8 | number_of_reviews | 2.070701 |
| 9 | number_of_reviews_ltm | 1.681439 |
| 10 | host_days | 10.567594 |
| 11 | host_identity_verified_1 | 3.084619 |
| 12 | host_is_superhost_1 | 1.369525 |
| 13 | host_has_profile_pic_1 | 18.650834 |
| 14 | instant_bookable_1 | 1.468394 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('accommodates')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.060710 |
| 1 | beds | 2.498602 |
| 2 | minimum_nights | 1.072142 |
| 3 | maximum_nights | 2.277926 |
| 4 | availability_30 | 10.659190 |
| 5 | availability_90 | 14.782891 |
| 6 | availability_365 | 3.997689 |
| 7 | number_of_reviews | 2.066248 |
| 8 | number_of_reviews_ltm | 1.681072 |
| 9 | host_days | 10.566618 |
| 10 | host_identity_verified_1 | 3.080280 |
| 11 | host_is_superhost_1 | 1.369523 |
| 12 | host_has_profile_pic_1 | 14.870742 |
| 13 | instant_bookable_1 | 1.468083 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('availability_90')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.060541 |
| 1 | beds | 2.497864 |
| 2 | minimum_nights | 1.072076 |
| 3 | maximum_nights | 2.276405 |
| 4 | availability_30 | 2.888895 |
| 5 | availability_365 | 2.882094 |
| 6 | number_of_reviews | 2.064745 |
| 7 | number_of_reviews_ltm | 1.680663 |
| 8 | host_days | 10.562973 |
| 9 | host_identity_verified_1 | 3.078783 |
| 10 | host_is_superhost_1 | 1.368959 |
| 11 | host_has_profile_pic_1 | 14.856959 |
| 12 | instant_bookable_1 | 1.466890 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('host_days')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.059159 |
| 1 | beds | 2.496097 |
| 2 | minimum_nights | 1.070690 |
| 3 | maximum_nights | 2.272827 |
| 4 | availability_30 | 2.875137 |
| 5 | availability_365 | 2.882093 |
| 6 | number_of_reviews | 1.973532 |
| 7 | number_of_reviews_ltm | 1.645357 |
| 8 | host_identity_verified_1 | 2.981644 |
| 9 | host_is_superhost_1 | 1.367551 |
| 10 | host_has_profile_pic_1 | 6.236564 |
| 11 | instant_bookable_1 | 1.395861 |
list_of_columns = list(vif.Columns)
list_of_columns.remove('host_has_profile_pic_1')
vif = vif_calculator(temp_df[list_of_columns])
vif
| Columns | VIF | |
|---|---|---|
| 0 | host_listings_count | 1.058241 |
| 1 | beds | 1.875701 |
| 2 | minimum_nights | 1.060311 |
| 3 | maximum_nights | 1.766902 |
| 4 | availability_30 | 2.859570 |
| 5 | availability_365 | 2.882091 |
| 6 | number_of_reviews | 1.956231 |
| 7 | number_of_reviews_ltm | 1.635748 |
| 8 | host_identity_verified_1 | 2.221928 |
| 9 | host_is_superhost_1 | 1.362209 |
| 10 | instant_bookable_1 | 1.347454 |
def variable_selection(df, predictors, target = 'price', alpha = 0.05):
'''
Perform variable selection on a given dataset based on some predictors. Fits the regression model to predictors and finds the optimal formula of them.
Args:
alpha: The dataframe we want to modify.
create_df: The dataframe we want to modify.
room_types: The dataframe we want to modify.
Returns: A list with the predictors.
'''
# Initialize the predictors list with the mandatory column 'room_type'.
pred = ['room_type']
# Initialize the formula with the mandatory column 'room_type'.
formula = f'{target} ~ room_type'
print(formula)
# Fit with the 'room_type'
fit = sm.OLS.from_formula(formula, df).fit()
r2 = fit.rsquared
print(r2)
for x in predictors:
# Concatenate to the formula string a new predictor
formula += f' + {x}'
print(formula)
# Fit with new formula
fit = sm.OLS.from_formula(formula, df).fit()
print(fit.rsquared)
# Check if R**2 increased
if r2 > fit.rsquared:
# If the new predictor did not increased the R ** 2 then remove it from formula
formula = formula.replace(f' + {x}','')
else:
# If the new predictor increased the R ** 2 then append it to predictors list
pred.append(x)
r2 = fit.rsquared
# Initialize a boolean variable as False
low_p_values = bool(1)
# While we have insignificant variables we will iterate
while low_p_values:
# Fit the model with formula
fit = sm.OLS.from_formula(formula, df).fit()
# print(fit.summary())
# print(f'\n{[x for x in dict(fit.pvalues).values()]}\n')
# Create a dict with name of variables and their p-values
p_values_dict = dict(fit.pvalues)
# Remove Intercept from the dict because we can not remove it or do anything with it
p_values_dict.pop('Intercept', None)
count = 0
max_p_value = 0
max_p_value_name = ''
# Iterate the dict
for k, v in p_values_dict.items():
# If the p-value of a variable is bigger than our alpha and it is not one of the room_type variables then store its name and p_value
if v > alpha and 'room_type' != k[:9]:
if v > max_p_value:
max_p_value = v
max_p_value_name = k
else:
count += 1
# If we found a variable with greater p_value than our alpha this will be True
if max_p_value != 0:
# Remove the variable with higher p-value than the alpha from formula
formula = formula.replace(f'{max_p_value_name} + ','')
print(f'Remove {max_p_value_name} with p-value --> {max_p_value}')
# If in any iteration no p-value is greater than our alpha the while loop will stop as we change the value of low_p_values to True
if count == len(p_values_dict) : low_p_values = bool(0)
# Return a list with the predictors
return pred
predictors = variable_selection(temp_df,list_of_columns,'price',0.05)
price ~ room_type 0.043875353084875224 price ~ room_type + host_listings_count 0.04575554158436823 price ~ room_type + host_listings_count + beds 0.1041680604213332 price ~ room_type + host_listings_count + beds + minimum_nights 0.10620950676875407 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights 0.10625371303777598 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 0.11376318364546889 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 + availability_365 0.1151305797806933 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 + availability_365 + number_of_reviews 0.1169034855027703 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 + availability_365 + number_of_reviews + number_of_reviews_ltm 0.11805339813848625 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 + availability_365 + number_of_reviews + number_of_reviews_ltm + host_identity_verified_1 0.11816651015216517 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 + availability_365 + number_of_reviews + number_of_reviews_ltm + host_identity_verified_1 + host_is_superhost_1 0.11845580631945207 price ~ room_type + host_listings_count + beds + minimum_nights + maximum_nights + availability_30 + availability_365 + number_of_reviews + number_of_reviews_ltm + host_identity_verified_1 + host_is_superhost_1 + instant_bookable_1 0.11877119329853236 Remove maximum_nights with p-value --> 0.42180556129689617 Remove host_identity_verified_1 with p-value --> 0.16601005183534406
predictors
['room_type', 'host_listings_count', 'beds', 'minimum_nights', 'maximum_nights', 'availability_30', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'host_identity_verified_1', 'host_is_superhost_1', 'instant_bookable_1']
def recommend_neighbourhood(df, budget_min, budget_max, relative):
'''
This function recommends a neighbourhood to guests who are traveling on a specific budget bracket.
Args:
df: The dataframe we want to search on.
budget_min: The minimum value for budget.
budget_max: The maximum value for budget.
relative: Boolean value, if True then the suggestion for the neighbourhood will be based on relative values. Otherwise the suggestion will be based on absolute values.
Returns: The name of the neighbourhood suggested.
'''
# Subset of original dataframe with prices in the range specified by budget_min and budget_max
results_in_range = df[(df['price'] >= budget_min) & (df['price'] <= budget_max)]
# Values of opportunities for each neighbourhood in the new subset
results_in_range_dict = results_in_range['neighbourhood_cleansed'].value_counts()
# Initializing values for neighbourhood variable that we will return at the end as suggestion for the guest
neighbourhood = ''
# Initializing max_value to 0 as it can not be lower or equal to 0
max_value = 0
# Values for opportunities for each neighbourhood in the original dataframe to be used if the comparison between neighbourhoods will be based on relative values
df_dict = df['neighbourhood_cleansed'].value_counts()
for k, v in results_in_range_dict.items():
if relative:
response = myfunction(k, v / df_dict[k], neighbourhood, max_value)
else:
response = myfunction(k, v, neighbourhood, max_value)
neighbourhood = response[0]
max_value = response[1]
return neighbourhood
def myfunction(k, v, neighbourhood, max_value):
'''
Compares v to max_value to check whether the new neighbourhood ( k ) has higher max_value than the previous ( neighbourhood ).
Args:
k: The neighbourhood we want to compare with the ongoing neighbourhood with most opportunities.
v: The value of opportunities for the neighbourhood we want to compare with the ongoing neighbourhood with most opportunities.
neighbourhood: The ongoing neighbourhood with most opportunities.
max_value: The ongoing value of opportunities for the neighbourhood with most opportunities.
Returns: Array with 2 values, the max_value either the same or new if v was higher and the neighbourhood based on the aforementioned condition.
'''
if v > max_value:
neighbourhood = k
max_value = v
return [neighbourhood, max_value]
recommend_neighbourhood(df, 50, 100, True)
'Gaasperdam - Driemond'
recommend_neighbourhood(df, 50, 100, False)
'De Baarsjes - Oud-West'
def distance(df, latitude, longitude):
'''
This function recommends a neighbourhood to guests who are traveling on a specific budget bracket.
Args:
df: The dataframe we want to search on.
latitude: The latitude representing latitude of the property the user wants to take into account.
longitude: The longitude representing latitude of the property the user wants to take into account.
Returns: The name of the neighbourhood suggested.
'''
df['distance'] = [(latitude - row['latitude']) ** 2 + (longitude - row['longitude']) ** 2 for index, row in df.iterrows()]
return df.sort_values(by=['distance'])
def recommend_price(df, latitude, longitude, n_neighbours, room_type = None):
'''
This function recommends a price for a room / flat / house a host want to offer on Airbnb.
Args:
df: The dataframe we want to search on.
latitude: The latitude representing latitude of the property the user wants to take into account.
longitude: The longitude representing latitude of the property the user wants to take into account.
n_neighbours: The number of neighbouring properties the user wants to take into account.
room_type: If specified, restricts the neighbours search to properties of the given room type.
Returns: The price recommendation for the property based on the mean price of neighbourhing properties.
'''
if room_type == None:
new_df = distance(df, latitude, longitude)
else:
new_df = distance(df[df['room_type'] == room_type], latitude, longitude)
price_recommendation = new_df.price[:n_neighbours].mean()
return price_recommendation
recommend_price(df, 52, 5, 5)
101.0
recommend_price(df, 52, 5, 50)
80.62